Final Report on Brooklyn Housing Dataset (2003-2017)
Class: Data Science 1 with R (STAT 301-1)
1 Introduction
For this report, I explored a dataset labelled “Brooklyn Home Sales 2003 to 2017,” which describes information in regards to all buildings, residential and nonresidential, sold in the New York borough of Brooklyn. The data comes from the government of the state of New York, and links to the data can be found under References (Section 5). Additionally, the cleaned dataset could not be uploaded to GitHub for the same reason, so I have added a Google Drive link to it here.
For this exploratory data analysis, I was primarily motivated by my curiosity in what could be the main motivators for selling more buildings in the Brooklyn market, especially since buildings in this area tend to be more expensive than the rest of the United States due to its urbanity and access to New York City as a whole. Additionally, I wished to look at what factors could have affected housing price increases in the area. Last of all, I wanted to look at what areas in Brooklyn are the most desired based on the popularity in sales in this dataset and what neighborhoods have become less popular.
2 Data Overview and Quality
For the original “Brooklyn Home Sales 2003 to 2017” dataset, there are 111 variables and 390,883 observations. There are 32 categorical variables, 71 numerical variables, 7 logical variables, and 1 date variable.
There are missingness issues for many columns associated with geographic mapping information, borough data (e.g. what borough the building is in), and some building information (like total units, building stories, etc.). Due to this missingness, I may be limited in my analysis of nonresidential buildings, as thorough building information could affect nonresidential building prices significantly. I do not believe the missing borough data will affect my analysis because all of the buildings sold are located in Brooklyn.
The dataset is not in the GitHub for this final project as it is too large (207.6 MB) to commit to the GitHub, so please refer to the hyperlink above in Introduction (Section 1) for access to the original dataset and the cleaned dataset.
The cleaned dataset has 390,833 observations and 81 columns. There are 19 categorical variables, 61 numeric variables, and 1 date variable.
3 Explorations
3.1 How many buildings are being sold in Brooklyn over time?
Before exploring this question, I wanted to check to see the different types of buildings being sold in Brooklyn, because my initial hypothesis assumes that the building type can greatly affect certain data about the building. The variable of tax_class_at_sale categorizes the type of building sold into 4 categories according to the state of New York:
- (Class 1): Includes most residential property of up to three units (such as one-, two-, and three-family homes and small stores or offices with one or two attached apartments), vacant land that is zoned for residential use, and most condominiums that are not more than three stories.
- (Class 2): Includes all other property that is primarily residential, such as cooperatives and condominiums.
- (Class 3): Includes property with equipment owned by a gas, telephone or electric company.
- (Class 4): Includes all other properties not included in class 1,2, and 3, such as offices, factories, warehouses, garage buildings, etc.
According to Figure 1 (a), it appears that the most actively sold buildings are primarily residential buildings in classes 1 and 2, with class 4 buildings coming third and class 3 buldings not being sold too often. Additionally, class 4 buildings appear to cause large skews in both price and square feet, as according to Figure 1 (b) and Figure 1 (c), the majority of larger and more expensive buildings were class 4.
Thus, I will separate my dataset into residential (class 1 and class 2) buildings and non-residential (class 3 and class 4) buildings, with the majority of my analysis focusing on the residential dataset.
After separating the dataset, I wanted to look at how the number of sales for buildings have changed year by year, as I want to see if there are any periods of downturns or relatively high counts of building sales.
In both Figure 2 (a) and Figure 2 (b), the distribution of sales in Brooklyn tended to be the highest from 2003 to 2006, while there was a significant dip in sales from 2007 to 2010, which was stronger for residential buildings than it was for non-residential buildings. From to 2011 to 2017, there appears to have been a recovery in building sales. These three time periods are interesting to look at in terms of determining if the three time periods had the same trends or not. For example, one immediate question I had was if the decrease in sales was due to an increase or decrease in price in the building market. To analyze this, I made the following line plots, which analyze the average sale prices for buildings sold in Brooklyn, grouped by year.
Looking at Figure 3 (a) and Figure 3 (b), it appears that the average sale price of buildings in Brooklyn actually decreased in the 2007-2010 range, which means that my hypothesis of lower sales meaning higher prices was incorrect. To investigate further, let us look at the price ranges of the buildings sold during the three time periods of 2003-2006, 2007-2010, and 2011-2017.
In Figure 4 (a), Figure 4 (b), and Figure 4 (c), the most common selling price in Brooklyn ranges from the $300k to $500k range, with this being the most prominent within Figure 4 (a). However, with Figure 4 (b), the number of homes sold at slightly above the $300k to $500k range significantly decreased, which could have caused the average price decreases for this time period. Within the 2011 to 2017 range, there has been a more subtle increase in properties being sold in the $700k to $1m range despite buildings in the $300k to $500k range being the most common, which could explain the immense increases in residential building prices over time.
3.2 What affects the price for buildings?
For the next part of my EDA, I wanted to look at what affects the price of buildings in Brooklyn. First off, I wanted to analyze the relationship between square feet and price using a scatter plot.
In Figure 5, looking at the linear fitting line, it appears that there is a general positive relationship between price and square feet, meaning that the more square feet a property has, the more expensive it is, which was expected. However, I also wanted to see if residential buildings, which as seen in Figure 3 (a) are getting more expensive over time on average, are also getting more square feet as a result.
Another variable I wanted to look at was prox_code, which identifies the proximity of the property to another property. prox_code is split into three categories: detached, semi-attached, and attached.
Figure 7 (a) reveals that on average, detached and attached buildings were around the same price points from 2003 to 2017, while semi-attached buildings tend to be cheaper than other properties. However, in Figure 7 (b), semi-attached buildings are the 2nd most sold, with attached buildings being the highest and detached buildings being the least sold. This information may indicate that the cheapness of semi-attached properties could result in more transfers of properties or sellings.
One factor I thought would be a major aspect of determining building price was the age of the building, since older buildings could be worse quality and therefore not as desired. First off, let’s look at the distribution of buildings sold from 2003 to 2017 by age.
Based on Figure 8, the residential buildings that were sold in Brooklyn the most were heavily older than 75 years old, although there is still a large bulk of buildings being sold that are 0 to 15 years old. Next, I wanted to see if the age affected the price strongly or not.
Analyzing Figure 9, it appears that the newer buildings (0 to 15 years old) are around the same price as the highly popular 75 to 90 and 90 to 105 range yet are not as sold as often. There also is an interesting dip in average building prices for the 25 to 75 year old building range, and a somewhat exponential growth in building price after 105 years, which could be due to limited samples in buildings that old.
3.3 What neighborhoods are the most popular in Brooklyn?
Another aspect that I wanted to analyze in the data is what neighborhoods are the most popular in Brooklyn and where people tend to buy property. There are an estimated 77 neighborhoods in Brooklyn, although the dataset only gives 61. A rough map for these neighborhoods can be seen below:
First off, I wanted to look at the neighborhoods with the highest number of residential buildings sold from 2003 to 2017, including the mean price of these properties.
From Figure 10, it appears that the most popular neighborhoods are not the most expensive ones, with the only neighborhood appearing in both the top 10 for count of residential buildings sold and average building price being Park Slope. Rather, these active neighborhoods tend to be lower in price in comparison to the overall averages as seen in Figure 3 (a). This information may indicate that these neighborhoods might be quite active in terms of moving ownership or at least might not be seen as places to stay “permanently.”
I also wanted to see if the average square feet of the buildings in these neighborhoods played a role in their popularity. In the following graph, each dot represents a neighborhood in Brooklyn.
One interesting observation is that the average square feet of many expensive (and less sold) neighborhoods is much lower than the average square feet of the neighborhoods that are sold more often. The more sold neighborhoods converge around the 1500 to 2500 average square feet range, which may indicate that having more space to live in can lead to more sales.
Next, I wanted to see if the most popular neighborhoods changed depending on the time period.
Looking at Figure 12 (a), Figure 12 (b), and Figure 12 (c), the neighborhoods of Bedford Stuyvesant, East New York, Bay Ridge, and Borough Park are the most sold residential properties in Brooklyn, which makes sense as these properties are generally cheaper than the overall average prices for residential housing in Brooklyn. However, comparing Figure 12 (a) and Figure 10, the neighborhood of Crown Heights appears to be an outlier, as it is relatively higher in price ($519,171.17) yet also getting increasingly popular.
4 Conclusions
After going through this EDA, I learned from the first section that the majority of buildings sold in Brooklyn are primarily residential, with the number of sales decreasing from the time period of 2007 to 2010 relative to 2003 to 2006 and slightly recovering in the 2011 to 2017 range. Post 2007 to 2010, the average sale prices for both residential and non-residential buildings have increased significantly, with average prices for residential buildings in 2017 hovering around $700,000, in comparison to the average prices for residential buildings in 2007 ranging around $450,000. I was surprised about the decreases in sales and average prices of buildings from 2007 to 2010, since I expected that the cause of buildings being sold less would be because they were too expensive.
In the second section of the EDA, I learned that despite gross square feet being partially positively correlated with sale price, there was not a significant increase in square feet of the residential buildings being sold from 2003 to 2006 in comparison to 2011 to 2017, even with the significant increases in price. I found this surprising because I expected that if one paid for a more expensive house, the house would be larger. Additionally, semi-attached residential buildings appear to be sold the cheapest on average, in comparison to attached and detached buildings. Last of all, I learned that the most expensive and most sold residential buildings were typically young (0 to 15 years old) or quite old (75 to 100 years old), while middle-aged buildings (40 to 60 years old) were relatively the cheapest. I was surprised by middle-aged buildings being cheaper than much older buildings, since I thought that there would be a consistent downward trend between price and building age.
In the third and final section of my EDA, I learned that the neighborhoods that had the most residential properties sold had relatively lower average sale prices and higher square feet than more expensive and less sold residential properties. Additionally, the neighborhoods with the most residential properties sold tended to remain the same for all three time periods of 2003 to 2006, 2007 to 2010, and 2011 to 2017. Specifically, Bedford Stuyvesant, East New York, Borough Park, and Bay Ridge were in the top five properties quite often, which suggests that their price point and size could be quite attractive to buyers. I expected the lower priced neighborhoods to be the most sold as there would be less of a barrier to obtain them, although I was surprised by the lower amounts of average square feet for the expensive neighborhoods buildings, since I thought a more expensive neighborhood would have more space.
Some additional points to look at for further exploration would be adjusting sale price with inflation to see if the values are actually starkly different or not, in addition to more analyzing economic data that could be compared with trends in housing, like unemployment or GDP. Additionally, having more up-to-date and complete geographic data of the addresses given in this dataset could allow for a more visual mapping of movement of sales over time with a Shiny app and slider. Lastly, adding data from 2018 to 2023 (as this dataset is being constantly added to) would serve as an interesting analysis and look into how building sales were affected by COVID-19, like how building sales were affected by the recession here from 2007 to 2010.
5 References
- “Annualized Sales Update.” NYC Department of Finance, New York City, www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page.
- Fitzgerald, Peter. “Brooklyn neighborhoods map.” https://commons.wikimedia.org/w/index.php?curid=7334362
- Wu, Tommy. “Brooklyn Home Sales, 2003 to 2017.” Kaggle, 15 Feb. 2018, www.kaggle.com/datasets/tianhwu/brooklynhomes2003to2017/data.